Google's YAMNet, trained on AudioSet's massive corpus of labeled audio, can distinguish hundreds of sound categories, and the distinction that matters here is between Speech (Speech, Narration, Conversation) and Music (Music, Singing, Wind instrument). For each ~0.48-second audio frame, YAMNet produces classification scores; subtract the music score from the speech score, and you get a per-frame signal that's positive when someone is talking and negative when music is playing.
That's the core of cutAudio.py. The other script, mergeAudiobooks.py, handles the preliminary housekeeping, taking a folder of M4A chapter files and merging them into a single continuous file with preserved chapter marks and metadata, because most audiobook downloads arrive as a pile of individual chapters.
classification:
model: YAMNet (TensorFlow Hub)
frame_interval: 0.96 × 0.5 seconds
speech_categories: [Speech, Narration, Conversation]
music_categories: [Music, Singing, Wind instrument]
score: speechMask - musicMask
min_music_duration: 10 frames (~5 seconds)
A minimum duration threshold of 10 frames (roughly 5 seconds) filters out false positives: a brief sound effect or a narrator with a momentarily melodic cadence won't trigger a cut.
Threshold Tuning
The detection threshold has no universal correct value. I used MinMaxLTTBDownsampler to render the frame-by-frame scores across entire audiobooks, which made it visible where the transitions were clean (most of the time, music plays, stops, narrator speaks) and where they were ambiguous (music fading gradually under narration, or a narrator with sing-song delivery). Each audiobook has its own audio characteristics, but the defaults handle most cases without adjustment.
After removing the music segments, the audio passes through FFmpeg's dynaudnorm (frame length 250, max gain 80 dB, Gaussian window 17, compression ratio 0.7) to smooth out the volume discontinuities at every cut point. Without normalization, each splice is audible as a sudden level change. Output is 32 kHz mono, encoded to FLAC for archival or AAC via libfdk_aac at 42 kbps for mobile listening (spoken word doesn't need high bitrate).
Some audiobooks have sweeping orchestral chapter intros, mood music during dramatic passages, musical stingers between scenes. For me this is all pure distraction. I want words, just words, at a consistent volume. The tool has processed dozens of books now. Occasionally a brief music-to-speech transition gets clipped, but the output is clean enough that I've stopped noticing the edits.